Assignment 5

Chad Pickering, 913328497

Corresponded with: Edie Espejo, Patrick Vacek, Graham Smith, Ricky Safran, Nivi Achanta, Sierra Tevlin, Hannah Kosinovsky, Janice Luong

Resources: A variety of package documentation. Lost to the void. It's 5:10am. I can't remember anymore.

Instructions: In this assignment, you'll scrape text from The California Aggie and then analyze the text.

The Aggie is organized by category into article lists. For example, there's a Campus News list, Arts & Culture list, and Sports list. Notice that each list has multiple pages, with a maximum of 15 articles per page.

The goal of exercises 1.1 - 1.3 is to scrape articles from the Aggie for analysis in exercise 1.4.

Exercise 1.1. Write a function that extracts all of the links to articles in an Aggie article list. The function should:

  • Have a parameter url for the URL of the article list.

  • Have a parameter page for the number of pages to fetch links from. The default should be 1.

  • Return a list of aricle URLs (each URL should be a string).

Test your function on 2-3 different categories to make sure it works.

Hints:

  • Be polite to The Aggie and save time by setting up requests_cache before you write your function.

  • Start by getting your function to work for just 1 page. Once that works, have your function call itself to get additional pages.

  • You can use lxml.html or BeautifulSoup to scrape HTML. Choose one and use it throughout the entire assignment.

In [1]:
# Import packages:

import lxml
import requests
from bs4 import BeautifulSoup
from collections import Counter
from matplotlib import pyplot as plt
from urllib2 import Request, urlopen
import pandas as pd
import numpy as np
import re
import functools
import nltk
from nltk import corpus
from nltk.stem.porter import PorterStemmer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.neighbors import NearestNeighbors
plt.style.use('ggplot')
%matplotlib inline
from fastcache import clru_cache
from nltk import word_tokenize
from nltk.tokenize import RegexpTokenizer
from nltk.corpus import stopwords
from collections import Counter
from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics.pairwise import linear_kernel
from sklearn.feature_extraction.text import TfidfVectorizer
from matplotlib import pyplot as plt
from scipy.sparse import csr_matrix
import wordcloud
from wordcloud import WordCloud, STOPWORDS
from itertools import izip
import random

1.1 Answer: The following two functions extract links to articles in an Aggie article list, the first extracting one page url, and the second extracting more than one page url, where the term 'page' means article url.

In [2]:
@clru_cache(maxsize=128,typed=False)
# Retrieves and parses one url:
def one_page(url):
    url_input = urlopen(url)
    bs_parsed = BeautifulSoup(url_input, "html.parser")
    return(bs_parsed)
In [3]:
@clru_cache(maxsize=128,typed=False)
# Retrieves and parses multiple urls on multiple pages using one_page() function (1 page as default):
def mult_pages(url, pages = 1):
    # Calls one_page() on pages indicated and outputs content from each url on each page
    page_list = [one_page(url + "/page/" + str(i) + "/") for i in range(1, pages + 1)]
    
    # Identifies "h2" tag amongst text and retrieves all context in tag for all pages in previously defined list
    url_list = [[text.findNext().get("href") for text in page.findAll("h2")] for page in page_list]
    
    # Gives list of all urls on pages specified as one single list without differentiating per page
    urls = [url for index in url_list for url in index]
    
    return(urls)
In [4]:
# Test function on campus url with >1 page; other tests ran and passed
# mult_pages("https://theaggie.org/sports", pages = 3)
# mult_pages("https://theaggie.org/campus", pages = 3)
mult_pages("https://theaggie.org/arts", pages = 2)
Out[4]:
[u'https://theaggie.org/2017/02/23/sacramentos-artstreet-exhibit-showcases-diverse-artwork/',
 u'https://theaggie.org/2017/02/23/armadillo-kdvs-collaborate-to-host-vinyl-and-music-fair/',
 u'https://theaggie.org/2017/02/23/harlows-nightclub-presents-khalid/',
 u'https://theaggie.org/2017/02/21/late-night-eats-in-davis/',
 u'https://theaggie.org/2017/02/21/2017-oscar-nominations-and-predictions/',
 u'https://theaggie.org/2017/02/20/student-sounds-samantha-sipin/',
 u'https://theaggie.org/2017/02/20/tv-revisited-the-office/',
 u'https://theaggie.org/2017/02/19/uc-davis-theater-and-dance-presents-its-newest-comedy/',
 u'https://theaggie.org/2017/02/19/twenty-one-pilots-emotional-roadshow-world-tour/',
 u'https://theaggie.org/2017/02/16/a-night-under-the-stars/',
 u'https://theaggie.org/2017/02/14/critically-acclaimed-stand-up-comic-brian-regan-to-perform-at-mondavi-center/',
 u'https://theaggie.org/2017/02/13/burning-love-parody-of-the-bachelor/',
 u'https://theaggie.org/2017/02/12/the-bachelor-engages-viewers-prompts-viewing-rituals/',
 u'https://theaggie.org/2017/02/12/seven-movies-from-the-seven-countries-targeted-by-president-trumps-muslim-ban/',
 u'https://theaggie.org/2017/02/12/the-best-of-times-and-the-worst-of-times-a-valentines-day-playlist/',
 u'https://theaggie.org/2017/02/09/events-calendar-for-february/',
 u'https://theaggie.org/2017/02/09/hopped-up-comedy-x-arrives-in-time-for-valentines-day-weekend/',
 u'https://theaggie.org/2017/02/09/lantern-festival-the-beginning-of-the-year/',
 u'https://theaggie.org/2017/02/07/top-three-places-to-eat-when-hungover/',
 u'https://theaggie.org/2017/02/06/dance-dance/',
 u'https://theaggie.org/2017/02/06/skams-universal-appeal/',
 u'https://theaggie.org/2017/02/06/t-v-revisited-breaking-bad/',
 u'https://theaggie.org/2017/02/02/headline-a-look-inside-kdvs/',
 u'https://theaggie.org/2017/02/02/spoken-word-allows-for-expression-of-much-more-than-words/',
 u'https://theaggie.org/2017/02/02/the-self-proclaimed-cure-for-racism/',
 u'https://theaggie.org/2017/02/02/a-feast-for-the-ears-just-in-time-for-lunch/',
 u'https://theaggie.org/2017/01/30/pronoun-highlights-struggles-of-transgender-youth/',
 u'https://theaggie.org/2017/01/30/through-the-artists-eye-sammy-sanchez-monter/',
 u'https://theaggie.org/2017/01/29/the-books-that-inspire-uc-davis-faculty/',
 u'https://theaggie.org/2017/01/29/an-earful-introducing-andy-shauf/']

Exercise 1.2. Write a function that extracts the title, text, and author of an Aggie article. The function should:

  • Have a parameter url for the URL of the article.

  • For the author, extract the "Written By" line that appears at the end of most articles. You don't have to extract the author's name from this line.

  • Return a dictionary with keys "url", "title", "text", and "author". The values for these should be the article url, title, text, and author, respectively.

For example, for this article your function should return something similar to this:

{
    'author': u'Written By: Bianca Antunez \xa0\u2014\xa0city@theaggie.org',
    'text': u'Davis residents create financial model to make city\'s financial state more transparent To increase transparency between the city\'s financial situation and the community, three residents created a model called Project Toto which aims to improve how the city communicates its finances in an easily accessible design. Jeff Miller and Matt Williams, who are members of Davis\' Finance and Budget Commission, joined together with Davis entrepreneur Bob Fung to create the model plan to bring the project to the Finance and Budget Commission in February, according to Kelly Stachowicz, assistant city manager. "City staff appreciate the efforts that have gone into this, and the interest in trying to look at the city\'s potential financial position over the long term," Stachowicz said in an email interview. "We all have a shared goal to plan for a sound fiscal future with few surprises. We believe the Project Toto effort will mesh well with our other efforts as we build the budget for the next fiscal year and beyond." Project Toto complements the city\'s effort to amplify the transparency of city decisions to community members. The aim is to increase the understanding about the city\'s financial situation and make the information more accessible and easier to understand. The project is mostly a tool for public education, but can also make predictions about potential decisions regarding the city\'s financial future. Once completed, the program will allow residents to manipulate variables to see their eventual consequences, such as tax increases or extensions and proposed developments "This really isn\'t a budget, it is a forecast to see the intervention of these decisions," Williams said in an interview with The Davis Enterprise. "What happens if we extend the sales tax? What does it do given the other numbers that are in?" Project Toto enables users, whether it be a curious Davis resident, a concerned community member or a city leader, with the ability to project city finances with differing variables. The online program consists of the 400-page city budget for the 2016-2017 fiscal year, the previous budget, staff reports and consultant analyses. All of the documents are cited and accessible to the public within Project Toto. "It\'s a model that very easily lends itself to visual representation," Mayor Robb Davis said. "You can see the impacts of decisions the council makes on the fiscal health of the city." Complementary to this program, there is also a more advanced version of the model with more in-depth analyses of the city\'s finances. However, for an easy-to-understand, simplistic overview, Project Toto should be enough to help residents comprehend Davis finances. There is still more to do on the project, but its creators are hard at work trying to finalize it before the 2017-2018 fiscal year budget. "It\'s something I have been very much supportive of," Davis said. "Transparency is not just something that I have been supportive of but something we have stated as a city council objective [ ] this fits very well with our attempt to inform the public of our challenges with our fiscal situation." ',
    'title': 'Project Toto aims to address questions regarding city finances',
    'url': 'https://theaggie.org/2017/02/14/project-toto-aims-to-address-questions-regarding-city-finances/'
}

Hints:

  • The author line is always the last line of the last paragraph.

  • Python 2 displays some Unicode characters as \uXXXX. For instance, \u201c is a left-facing quotation mark. You can convert most of these to ASCII characters with the method call (on a string)

    .translate({ 0x2018:0x27, 0x2019:0x27, 0x201C:0x22, 0x201D:0x22, 0x2026:0x20 })

    If you're curious about these characters, you can look them up on this page, or read more about what Unicode is.

In [5]:
url_test = "https://theaggie.org/2017/02/15/suspect-in-davis-islamic-center-vandalism-arrested/"
In [6]:
@clru_cache(maxsize=128,typed=False)

def article_cont(url):
    # Parses content, outputs html block
    url_input = urlopen(url)
    bs_parsed = BeautifulSoup(url_input, "html.parser")
    
    ### Title ###
    
    # Tags to be traversed to get text to be used for titles if try fails
    title_exc = bs_parsed.find_all("div", {"itemprop", "articleBody"})
    
    try:
        title = bs_parsed.find_all("h1", {"class": "entry-title"})[0].text.encode('ascii', 'ignore')
    except:
        try:
            title = title_exc.find_all("strong")[0].text.encode('ascii', 'ignore')
        except:
            title = np.NaN
    
    ### Text ###   
    
    # Tags to be traversed to find where the body of the entry is
    text_body = bs_parsed.find_all("div", {"class", "entry-content"})[0]

    # Get paragraphs, join as one body of text, and remove unicode
    ind_paras = [text_body.find_all("span", {"style":"font-weight: 400;"})[x].text for x in range(len(text_body.find_all("span", {"style":"font-weight: 400;"})))]
    paras_all = "".join(ind_paras[:-1]).strip().encode('ascii', 'ignore')

    ### Author ###
    
    # Retrieve part of articleBody in which author is contained
    itemprop_body = str([item.text for item in bs_parsed.find_all("div") if item.get("itemprop") == "articleBody"])
    
    try:
        author = re.split("Written [Bb]y: ", itemprop_body, re.IGNORECASE)[1]
        author = re.split("\\\\", author.encode('ascii', 'ignore'))[0]
    except: # Use regex to explicitly search for first and last name if try fails
        try:
            author_search = re.search("Written [Bb]y: [A-z]{1,20} [A-z]{2,20}", bs_parsed.text)
            author = author_search.group(0)
            author = re.split("Written [Bb]y: ", author)[1]
        except:
            author = np.NaN
        
    ### Dictionary ###
    
    dic = {"author": author, "text": paras_all, "title": title, "url": url}
    return(dic)
In [7]:
article_cont(url_test)
Out[7]:
{'author': 'Samantha Solomon ',
 'text': 'Police arrested a Lauren Kirk-Coehlo, resident of Davis and graduate of Davis High School, on the morning of Feb. 14 as suspect in the Islamic Center of Davis vandalism case, which investigators and state and federal prosecutors have labeled a hate crime. The arrest comes after nearly a month of joint investigation by the Davis Police Department (DPD) and the FBI. The UC Davis issued a crime alert soon after the arrest stating, Soon after the crime was reported, and the surveillance footage was released, the Police Department received numerous tips regarding the vandalism. Kirk-Coehlo is currently booked in the Yolo County jail for felony vandalism with hate crime enhancement. The suspect faces up to six year in prison if she is convicted, and bail has been set at $1 million. Kirk-Coelhos arraignment hearing is set for Feb. 16 at 1:30 p.m. The vandalism of the Islamic Center occurred on the morning of Jan. 22 during which an estimated $7,000 worth of damage was inflicted. The incident was caught on a surveillance camera from the mosque. Video footage shows a female suspect smashing six window panes and placing something on the exterior door handle of the Islamic Center of Davis. It was later determined that strips of bacon were placed on the door handle, said Jonathan Raven, chief deputy district attorney in a press release.Shortly after the footage was released, The Sacramento Valley chapter of the Council on American-Islamic Relations (CAIR-SV) called on state and federal law enforcement to investigate the motive behind the vandalism. Political, religious or ideological beliefs are not an excuse to commit hate crimes, said Monica Miller, special agent in charge of the Sacramento FBI office in an interview with the Sacramento Bee after the arrest.  Members of the mosque have since rallied together, and with help from the Davis community, raised $20,000 dollars for repairs. On behalf of the Muslim community in Davis, we would like to thank you for your contribution to help repair our Masjid, said Omar Awad, UC Davis Muslim Student Association president and Shifa Community Clinic volunteer, on the fundraiser page',
 'title': 'Suspect in Davis Islamic Center vandalism arrested',
 'url': 'https://theaggie.org/2017/02/15/suspect-in-davis-islamic-center-vandalism-arrested/'}

Exercise 1.3. Use your functions from exercises 1.1 and 1.2 to get a data frame of 60 Campus News articles and a data frame of 60 City News articles. Add a column to each that indicates the category, then combine them into one big data frame.

The "text" column of this data frame will be your corpus for natural language processing in exercise 1.4.

In [8]:
# Campus and city articles
campus_pages = mult_pages("https://theaggie.org/campus", pages = 4)
city_pages = mult_pages("https://theaggie.org/city", pages = 4)
combined_pages = campus_pages + city_pages

all_cont = [article_cont(l) for l in combined_pages]
aggie_df = pd.DataFrame(all_cont)

# Create source column
cat = np.array(["campus", "city"])
aggie_df["source"] = np.repeat(cat, [60, 60], axis=0)
aggie_df
Out[8]:
author text title url source
0 Alyssa Vandenberg Current ASUCD Vice President Abhay Sandhu anno... 2017 Winter Quarter election results https://theaggie.org/2017/02/24/2017-winter-qu... campus
1 Aaron Liss and Raul Castellanos The University of California (UC) will retract... University of California, Davis City Council s... https://theaggie.org/2017/02/23/university-of-... campus
2 Kimia Akbari In light of the recent executive order, univer... Academics unite in peaceful rally against immi... https://theaggie.org/2017/02/23/academics-unit... campus
3 Kenton Goldsby Students have awaited the full reopening of th... Memorial Union to reopen Spring Quarter https://theaggie.org/2017/02/23/memorial-union... campus
4 Ivan Valenzuela Last month ASUCD President Alex Lee issued a s... ASUCD President Alex Lee vetoes amendment for ... https://theaggie.org/2017/02/23/asucd-presiden... campus
5 Alyssa Vandenberg Senate candidate Zaki Shaheen chose to publicl... Senate candidate Zaki Shaheen withdraws from race https://theaggie.org/2017/02/22/senate-candida... campus
6 Aaron Liss The UC Davis community recently received two c... UC Davis experiences several recent hate-based... https://theaggie.org/2017/02/21/uc-davis-exper... campus
7 Alyssa Vandenberg University of California (UC) President Janet ... UC President selects Gary May as new UC Davis ... https://theaggie.org/2017/02/21/uc-president-s... campus
8 Jeanna Totah Due to new policies implemented after the inve... Katehi controversy prompts decline of UC admin... https://theaggie.org/2017/02/20/katehi-controv... campus
9 Ivan Valenzuela On Jan. 26, ASUCD passed a new resolution to s... ASUCD Senate passes resolution submitting comm... https://theaggie.org/2017/02/20/asucd-senate-p... campus
10 Yvonne Leong The University of Californias 13th Annual Repo... UC releases 2016 Annual Report on Sustainable ... https://theaggie.org/2017/02/20/uc-releases-20... campus
11 Kenton Goldsby In a packed main lobby of the UC Davis Interna... UC Davis Global Affairs holds discussion on Pr... https://theaggie.org/2017/02/19/uc-davis-globa... campus
12 Kimia Akbari President Donald Trump signed an executive ord... Trumps immigration ban affects UC Davis community https://theaggie.org/2017/02/19/trumps-immigra... campus
13 Kaitlyn Cheung UC Davis students and community members protes... UC Davis students participate in UC-wide #NoDA... https://theaggie.org/2017/02/17/uc-davis-stude... campus
14 Jayashri Padmanabhan UC Davis held its first mental health conferen... UC Davis holds first mental health conference https://theaggie.org/2017/02/17/uc-davis-holds... campus
15 Demi Caceres The ASUCD Senate meeting was called to order b... Last week in Senate https://theaggie.org/2017/02/16/last-week-in-s... campus
16 Alyssa Vandenberg and Emilie DeFazio Executive: Josh Dalavai and Adilla JamaludinIn... 2017 ASUCD Winter Elections Meet the Candidates https://theaggie.org/2017/02/16/2017-asucd-win... campus
17 Ivan Valenzuela A new exhibit recently opened at Peter J. Shie... Shields Library hosts new exhibit for Davis ce... https://theaggie.org/2017/02/14/shields-librar... campus
18 Demi Caceres Students promote fruit and vegetable meals via... Student Health and Counseling Services hosts S... https://theaggie.org/2017/02/14/student-health... campus
19 Lindsay Floyd To compensate for a decrease in university-all... PE classes may charge additional fees https://theaggie.org/2017/02/13/pe-classes-may... campus
20 Jeanna Totah University News announced the names of the 11 ... 11 new Chancellor Fellows honored for 2016 https://theaggie.org/2017/02/12/11-new-chancel... campus
21 Aaron Liss The UC Davis Muslim Student Association (MSA) ... Muslim students respond to recent political ev... https://theaggie.org/2017/02/12/muslim-student... campus
22 Lindsay Floyd On Feb. 1, Student Health and Counseling Servi... Sexcessful Campaign launched in time for Valen... https://theaggie.org/2017/02/12/sexcessful-cam... campus
23 Alyssa Vandenberg Michael Chan, a fourth-year computer science m... Michael Chan sworn in as interim senator https://theaggie.org/2017/02/10/michael-chan-s... campus
24 Kenton Goldsby The Regents of the University of California (U... University of California Regents meet, approve... https://theaggie.org/2017/02/09/university-of-... campus
25 Yvonne Leong The ASUCD Senate meeting was called to order b... Last week in Senate https://theaggie.org/2017/02/09/last-week-in-s... campus
26 Jayashri Padmanabhan UC Davis received $2.2 million in state fundin... UC Davis receives $2.2 million from Assembly B... https://theaggie.org/2017/02/09/uc-davis-recei... campus
27 Ivan Valenzuela During the Davis College Democrats (DCD) secon... Senator Bill Dodd visits UC Davis https://theaggie.org/2017/02/06/senator-bill-d... campus
28 Kenton Goldsby One of the many new California laws that took ... AB 1887 prevents use of state funds, including... https://theaggie.org/2017/02/05/ab-1887-preven... campus
29 Jayashri Padmanabhan The University of California (UC) system annou... UC system hires Title IX coordinator https://theaggie.org/2017/02/02/uc-system-hire... campus
... ... ... ... ... ...
90 Raul Castellanos Jr. Sweet, salty, sour and spicy. Thai food has it... No such thing as too much Thai food https://theaggie.org/2017/01/17/no-such-thing-... city
91 Kaelyn Tuermer-Lee For the past four years, the Petco Foundation ... A dog named Disney wins grant money for Rotts ... https://theaggie.org/2017/01/16/a-dog-named-di... city
92 Andie Joldersma Good news is in the air for Yolo County reside... Yolo County Library materials to be more widel... https://theaggie.org/2017/01/15/yolo-county-li... city
93 Sam Solomon Dec. 24Downstairs neighbor tapping on RPs floo... Police Logs https://theaggie.org/2017/01/12/police-logs-7/ city
94 Dianna Rivera The Davis Manor Neighborhood (DMN) celebrated ... Neighbors unite https://theaggie.org/2017/01/12/neighbors-unite/ city
95 Kaelyn Tuermer-Lee Many believe that standardized testing is a me... Sparks fly in Light the Fire https://theaggie.org/2016/12/09/sparks-fly-in-... city
96 Raul Castellanos Jr. Davis, California home to thousands of cyclis... Bike Campaign offers bicycles to those who can... https://theaggie.org/2016/12/09/bike-campaign-... city
97 Juno Bhardwaj-shah On Nov. 19, the Davis Vanguard hosted its annu... Bail reform advocates gather at annual fundraiser https://theaggie.org/2016/12/08/bail-reform-ad... city
98 Samantha Solomon Celebrate the holidays 19th century style at O... Twas the Night Before Christmas in Old Sacramento https://theaggie.org/2016/12/07/twas-the-night... city
99 Sam Solomon Candles lit the streets of Davis on Dec. 1 as ... Childrens Candlelight Parade lights up downtow... https://theaggie.org/2016/12/05/childrens-cand... city
100 Sam Solomon Nov. 20Hydraulic lift parked across from RPs r... Police Logs https://theaggie.org/2016/12/04/police-logs-6/ city
101 Dianna Rivera The Yolo County Childrens Alliance (YCCA) held... The season of giving https://theaggie.org/2016/12/04/the-season-of-... city
102 Samantha Solomon As blue and red police lights wailed through t... NoDAPL protest erupts in downtown Davis https://theaggie.org/2016/12/02/nodapl-protest... city
103 Andie Joldersma Five, four, three, two, one Go!Those were the ... Davis Turkey Trot: more than just another race https://theaggie.org/2016/12/02/davis-turkey-t... city
104 Andie Joldersma With rising concerns of climate change, green ... Affordable, clean, green energy is coming to Y... https://theaggie.org/2016/11/30/affordable-cle... city
105 Bianca Antunez Since 1984, Davis residents have voted to supp... Measure H passes, voters support Davis schools https://theaggie.org/2016/11/29/measure-h-pass... city
106 Anya Rehon Overall trends of public intoxication arrests ... Local residents attend Davis town hall meeting https://theaggie.org/2016/11/29/local-resident... city
107 Sam Solomon Nov. 15Elderly driver not stopping at stop sig... Police Logs https://theaggie.org/2016/11/27/police-logs-5/ city
108 Raul Castellanos Jr. The public piano on the corner of Second and E... Public piano destroyed in act of vandalism https://theaggie.org/2016/11/27/public-piano-d... city
109 Samantha Solomon As darkness fell over the crowd of people who ... Holding the Light https://theaggie.org/2016/11/27/holding-the-li... city
110 Dianna Rivera It is 4:45 in the afternoon, and the sun is al... Sun down, bike lights out https://theaggie.org/2016/11/22/sun-down-bike-... city
111 Sam Solomon Nov. 7Subject stated our pizza is ready and th... Police Logs https://theaggie.org/2016/11/22/police-logs-4/ city
112 Kaelyn Tuermer-Lee One thing that Davis certainly doesnt have tre... Tune in to Watermelon Musics strings-for-food ... https://theaggie.org/2016/11/21/tune-in-to-wat... city
113 Andie Joldersma Water is sacred, water is life, chanted hundre... Water is sacred, water is life https://theaggie.org/2016/11/20/water-is-sacre... city
114 Anya Rehon The Yolo Food Bank (YFB) provides meals and ad... The Yolo Food Bank addresses food insecurity https://theaggie.org/2016/11/17/the-yolo-food-... city
115 Bianca Antunez As Election Day came to a close, and the resul... Nov. 8 2016: An Election Day many may never fo... https://theaggie.org/2016/11/17/nov-8-2016-an-... city
116 NaN Oct. 30Turkey in the roadway, vehicles stoppin... Police Logs https://theaggie.org/2016/11/15/police-logs-3/ city
117 Bianca Antunez For the eighth year in a row, the Yolo Food Ba... Yolo Food Banks eighth Annual Running of the T... https://theaggie.org/2016/11/15/yolo-food-bank... city
118 Raul Castellanos On the Monday before Election Day, energy ran ... Return of the Bern https://theaggie.org/2016/11/15/return-of-the-... city
119 Alana Joldersma Change is on its way to Davis Senior High Scho... Construction of the All Student Center at Davi... https://theaggie.org/2016/11/14/construction-o... city

120 rows × 5 columns

Exercise 1.4. Use the Aggie corpus to answer the following questions. Use plots to support your analysis.

  • What topics does the Aggie cover the most? Do city articles typically cover different topics than campus articles?

  • What are the titles of the top 3 pairs of most similar articles? Examine each pair of articles. What words do they have in common?

  • Do you think this corpus is representative of the Aggie? Why or why not? What kinds of inference can this corpus support? Explain your reasoning.

Hints:

  • The nltk book and scikit-learn documentation may be helpful here.

  • You can determine whether city articles are "near" campus articles from the similarity matrix or with k-nearest neighbors.

  • If you want, you can use the wordcloud package to plot a word cloud. To install the package, run

    conda install -c https://conda.anaconda.org/amueller wordcloud

    in a terminal. Word clouds look nice and are easy to read, but are less precise than bar plots.

What topics does the Aggie cover the most? Do city articles typically cover different topics than campus articles?

Aggie topics in general:

Analysis: The following is a word cloud of the most common terms in titles, and then the text bodies, for all articles scraped in the data frame. We notice that recent events such as the protests are common in the titles, as well as other political key words or news about the new chancellor to get people's attention and succinctly describe the content of the article. Whereas in the text, we see verbs and adverbs are more common, words that describe rather than attract immediate attention. Further, the distribution of frequencies of words for the terms in the text bodies are much more skewed toward the smaller end because, as expected, more kinds of words are used in the text rather than the title.

In [9]:
# Word cloud for all 120 articles:

categories = ["title", "text"]

stopwords = STOPWORDS
stopwords2 = set(["UC", "Davis", "new", "news", "police", "logs", "might", "also", "come", "don't", "student", "said", "will", "Yolo", "really", "going", "day", "students", "year", "city", "campus", "last", "week"])
stopwords = set(stopwords).union(stopwords2)

for cat in categories:
    rel_text = list((" ".join(aggie_df[cat])).split(" "))
    terms = [term for term in rel_text if term not in stopwords]
    terms = " ".join(terms)

    # Word cloud image:
    wc = WordCloud(background_color = "white", max_words=1000, stopwords=stopwords, width=800, height=400)
    wc.generate(terms)
    print(cat)
    
    # Color:
    def col_gray(word, font_size, position, orientation, random_state=None, **kwargs):
        return "hsl(50, 0%%, %d%%)" % random.randint(10, 50)

    # Plot:
    plt.figure(figsize=(20,10))
    plt.imshow(wc.recolor(color_func = col_gray, random_state=3))
    plt.axis("off")
    plt.tight_layout(pad=0)
    plt.show()
title
text

Campus vs. city articles:

Analysis: I analyze titles instead of the text for determining if campus and city articles cover different topics because the title is a better measure of "key words" that describe topics. Campus articles look like they cover more political topics, be it national politics or ASUCD politics (senate, etc.). We see topics involving the new chancellor, sustainability, recent protests, and campus services. On the other hand, city articles involve community topics such as food, public events, art, residential issues, and local campaigns. Political issues seem to overlap, and so do local and state events, which makes intuitive sense.

In [10]:
# Word cloud for campus and city titles:
sources = ["campus", "city"]

stopwords = STOPWORDS
stopwords2 = set(["UC", "Davis", "new", "news", "police", "logs", "might", "also", "come", "don't", "student", "said", "will", "Yolo", "really", "going", "day", "students", "year", "city", "campus", "last", "week"])
stopwords = set(stopwords).union(stopwords2)

for source in sources:    
    rel_text = list((" ".join(aggie_df.loc[aggie_df['source'] == source]["title"])).split(" "))
    terms = [term for term in rel_text if term not in stopwords]
    terms = " ".join(terms)

    # Word cloud image:
    wc = WordCloud(background_color = "white", max_words=1000, stopwords=stopwords, width=800, height=400)
    wc.generate(terms)
    print(source)
    
    # Color:
    def col_gray(word, font_size, position, orientation, random_state=None, **kwargs):
        return "hsl(50, 0%%, %d%%)" % random.randint(10, 50)

    # Plot:
    plt.figure(figsize=(20,10))
    plt.imshow(wc.recolor(color_func = col_gray, random_state=3))
    plt.axis("off")
    plt.tight_layout(pad=0)
    plt.show()
    
campus
city

Barplot of most common words in text:

Filtering out stop words, we see that the most common words are directly community and Davis-related, as well as generally common words one would find in any body of writing.

In [11]:
stopwords = STOPWORDS
stopwords2 = set(["UC", "Davis", "new", "news", "police", "logs", "might", "also", "come", "don't", "student", "said", "will", "Yolo", "really", "going", "day", "students", "year", "city", "campus", "last", "week"])
stopwords = set(stopwords).union(stopwords2) # 'UC' and 'Davis' do not filter

w0 = [re.findall(r'\w+', aggie_df['text'][x]) for x in range(len(aggie_df))]
w1 = [item for sublist in w0 for item in sublist]
wordz  = [word for word in w1 if word.lower() not in stopwords]
term_freq = Counter(wordz).most_common()[0:100]

termfreq_df = pd.DataFrame(term_freq)
termfreq_df.columns = ['term', 'freq']
termfreq_df = termfreq_df.head(20)

lbls = list(termfreq_df.ix[:,0])

indexes = np.arange(len(termfreq_df))
freqs = list(termfreq_df.ix[:,1])

plt.bar(indexes, freqs, align = 'center', alpha = 0.5)
plt.xticks(indexes, lbls, rotation=75, fontsize = 9)
plt.ylabel('Frequency of term')
plt.title('Most Common Terms in the Text of Articles Scraped')
    
plt.show()

What are the titles of the top 3 pairs of most similar articles? Examine each pair of articles. What words do they have in common?

Analysis: Here, a sparse matrix is used, where the similarity score is based on the number of words that the pair of articles share and the frequency of those words. This method is moderately to severely flawed because length of article directly affects the score - there is no weighting method to account for number of words for a pair of articles. This methodology means that sometimes, a similiarity score between two different articles is higher than the similarity score between one of those articles and itself because of the difference in relative length between the two articles and how much they have in common. Despite this, the top three pairs based on thsi measure are shown, and 20 terms shared between them are displayed in a convenient data frame below. The following code finds the similarity scores between all combinations of articles, then filters out the articles whose maximum similarity score is with itself.

In [12]:
# From Lesson 11:
stemmer = PorterStemmer().stem
tokenize = nltk.word_tokenize

def stem(tokens,stemmer = PorterStemmer().stem):
    return [stemmer(w.lower()) for w in tokens] 

def lemmatize(text):
    return stem(tokenize(text))

def that_one_time_i_made_a_coo(param):
    tuples = izip(param.row, param.col, param.data)
    return sorted(tuples, key=lambda x: (x[0], x[2]), reverse = True)

vectorizer = TfidfVectorizer(tokenizer=lemmatize,stop_words="english",smooth_idf=True,norm=None)
tfs = vectorizer.fit_transform(aggie_df["text"])
sim = tfs.dot(tfs.T)
sim_mx = csr_matrix(sim)
y = sim_mx.tocoo()
order_mx = that_one_time_i_made_a_coo(y)
same_art = order_mx[0::120]

most_sim = []
for x in range(len(same_art)):
    if (same_art[x][0] != same_art[x][1]) == True:
        most_sim.append(same_art[x])
        
sim_df = sorted(most_sim, key=lambda x: x[2], reverse = True)
sim_df
Out[12]:
[(35, 14, 11704.025711881699),
 (27, 16, 8714.2332118683698),
 (119, 16, 6673.8192827400189),
 (3, 16, 6130.199850693969),
 (59, 16, 4011.3847758507645),
 (5, 16, 3658.8637555851906),
 (0, 16, 3350.1272595908713),
 (7, 16, 2734.3243767737508),
 (38, 24, 2510.7583531326391),
 (32, 16, 1943.8385803597066),
 (23, 16, 1228.3985935757246),
 (107, 16, 1155.3021072203637),
 (82, 96, 666.87572397582858)]

The top 3 most similar articles are as follows:

In [13]:
print aggie_df["title"][35], "\n", aggie_df["title"][14] # most similar
UC Davis to host first ever mental health conference 
UC Davis holds first mental health conference
In [14]:
print aggie_df["title"][27], "\n", aggie_df["title"][16] # second most similar
Senator Bill Dodd visits UC Davis 
2017 ASUCD Winter Elections  Meet the Candidates
In [15]:
print aggie_df["title"][119], "\n", aggie_df["title"][16] # third most similar
Construction of the All Student Center at Davis High begins 
2017 ASUCD Winter Elections  Meet the Candidates

Analysis: The words the article pairs have in common are as follows. We can see that in the first pair, as expected, words involving mental health, collaboration, and personal terminology are very common, which, by the look of the two titles, agree with the content sufficiently. In the second and third pairs we see more community and political terminology, which is, again, expected because of the titles we see in both pairs. For specifics, see below:

In [16]:
mostsim1 = set.intersection(set(aggie_df["text"][35].split(" ")), set(aggie_df["text"][14].split(" ")))
mostsim2 = set.intersection(set(aggie_df["text"][27].split(" ")), set(aggie_df["text"][16].split(" ")))
mostsim3 = set.intersection(set(aggie_df["text"][119].split(" ")), set(aggie_df["text"][16].split(" ")))
sim_list = [mostsim1, mostsim2, mostsim3]
sim_heads = [pd.DataFrame([s]).T.head(20) for s in sim_list]
sim_df = pd.concat([sim_heads[0], sim_heads[1], sim_heads[2]], axis=1)
sim_df.columns = ['Pair 1', 'Pair 2', 'Pair 3']
sim_df
Out[16]:
Pair 1 Pair 2 Pair 3
0 help all all
1 able help help
2 workshops just just
3 involved over its
4 its policy, before
5 Porter through opportunities
6 We go environment
7 suicide its to
8 based really, program
9 Chiang, looking has
10 personal opportunities which
11 Samantha to community
12 to going not
13 choose input during
14 then 2016 groups,
15 them might see
16 She do are
17 spiritual it, our
18 mental his project
19 people finance said

Do you think this corpus is representative of the Aggie? Why or why not? What kinds of inference can this corpus support? Explain your reasoning.

Analysis: This corpus is not representative of the Aggie whatsoever. Our sample is neither random nor sufficiently large (compared to the size of the body of articles that exist in archived form online) to make any meaningful inference that is representative of the articles that the Aggie has published. First of all, the Aggie writes in many subcategories that we have not considered (outside of campus, arts, etc.), so to make inference on the body of work from a small subset of the categories available is extremely short-sighted. Second, to make inference to all articles, we must sample randomly from all work ever published by the Aggie (in reality, also those published not online); this leads to my last point that the sample is a very small subset of the larger set. The combination of these factors make this sample of 120 (60 from campus and 60 from city) an extremely poor representation of the Aggie's articles. Additionally, because the sampling is only from the past couple months, the terms are incredibly biased toward the nouns and such that have been in news cycles lately, not overall in the time frame when the Aggie has been in publication.

This corpus can support inference within the time frame of the past few months (in the range of the articles we scraped) and only in the campus and city subcategorities. (Fun fact: Despite the great amount of work that I put into this assignment, I would not trust anything I did to make any meaningful inference at all as much of this is a crude representation of what is actually truth. I want you to know this as I enter the next stage of my life where I could start making decisions that affect people's lives.)